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Abstract —Google’s Project Loon m was launched in 2013 
with the aim of providing Internet access to rural and remote 
areas. In the Loon network, balloons travel around the Earth 
and bring access points to the users who cannot connect directly 
to the global wired Internet. The signals from the users will be 
transmitted through the balloon network to the base stations 
connected to the Internet service provider (ISP) on Earth. 
The process of transmitting and receiving data consume a 
certain amount of energy from the balloon, while the energy 
on balloons cannot be supplied by stable power source or by 
replacing batteries frequently. Instead, the balloons can harvest 
energy from natural energy sources, e.g., solar energy, or from 
radio frequency energy by equipping with appropriate circuits. 
However, such kinds of energy sources are often dynamic and 
thus how to use this energy efficiently is the main goal of this 
paper. In this paper, we study the optimal energy allocation 
problem for the balloons such that network performance is 
optimized and the revenue for service providers is maximized. 
We first formulate the stochastic optimization problem as a 
Markov decision process and then apply a learning algorithm 
based on simulation-based method to obtain optimal policies for 
the balloons. Numerical results obtained by extensive simulations 
clearly show the efficiency and convergence of the proposed 
learning algorithm. 

Keywords- Internet in the sky, Google Loon Project, Markov 
decision process. 

I. Introduction 

After more than 40 years of development Internet has 
created a revolution in communication for humans because it 
allows people to access and exchange information efficiently. 
Although Internet is highly accessible, approximately 60-70% 
of people worldwide do not have the Internet reported by 
International Telecommunications Union El in June 2013. 
This stems from a fact that many areas such as Africa, 
Asia, and Pacific, cannot offer Internet connections due to 
geographical and infrastructure issues. Therefore, the idea 
of providing Internet connections via wireless networks has 
become more and more popular. 

In wireless Internet, mobile users can connect to the Internet 
service provider (ISP) through base stations or access points. 
However, deployment of base stations for every location on 
the Earth seems to be impossible, e.g., oceans and mountains. 
Therefore, the idea of providing Internet from the sky was 
introduced. The early version is based on the satellites, which 
suffers from high cost and long transmission delay 0. As 
a result, the cheaper and faster alternative, i.e., Google Loon 
project HI, was proposed. In Loon project, access points will 
be placed on balloons flying at an altitude of about 20 km 


which is safe from bad weather and flights. The balloons will 
travel around the Earth and form a network of access points 
for Internet users in remote places. When receiving data from 
the user, the balloon will find the shortest route to transfer 
data to the nearest base station on the ground, which will be 
forwarded to an ISP. 



Fig. 1: The model of wireless network in the sky. 


Data transmission and reception by the access points on the 
balloon consume energy, which requires continuous supply 
to sustain the operations. Only viable energy sources for the 
balloon are through energy harvesting such as solar energy or 
radio frequency (RF). The harvested energy will be stored in 
the energy storage of the access points and it will be used for 
data transmission and reception. However, batteries equipped 
on balloons are often limited by size and energy harvested 
from solar or RF is random. Moreover, the balloons may 
have to serve data transfer requests from different types of 
users, e.g., from other balloons, on the ground, satellites, or 
aircrafts, with different quality of service (QoS) requirements. 
Therefore, energy management for the balloons is an impor¬ 
tant issue. 

In this paper, we aim to find an optimal admission control 
policy for the access points deployed on the balloons. The goal 
is to ensure high energy efficiency while maximizing profit of 
service providers. We formulate a Markov decision process 
(MDP) for the energy allocation optimization problem. To 
obtain the optimal policy, we apply a learning algorithm based 
on the policy gradient method and simulation-based method. 
The proposed learning algorithm not only avoids the curse 
of dimensionality problem caused by the explosion of state 





and action spaces, but also eliminates the need for complete 
knowledge about the model, which may not be possible to 
have from an unpredictable environment. Numerical results 
show the convergence as well as the efficiency of the proposed 
learning algorithm. 

II. System Model 



We consider energy management problem for a balloon 
with an access point as shown in Fig. [2] Specifically, the 
access point is to receive and transmit data from users. The 
access point has a battery to store energy harvested from 
solar and RF. When a request is sent to the access point, 
it will check the amount of energy remaining in the battery 
and apply an admission control policy by deciding whether 
the request will be accepted or rejected. If the request is 
accepted, a certain amount of energy from the battery will 
be used to receive data and transmit data to the next hop. 
Different requests have different QoS requirements. Therefore, 
we divide requests into three classes, i.e., requests from other 
balloons (i.e., class- 1 ), requests from users on the ground 
(class-2), and requests from satellites or aircrafts (class-3). 
The arrival processes of requests from class-1, class-2, and 
class-3 are assumed to follow the Poisson distribution with 
mean rates A&, A c , and A s , respectively. When a request is 
accepted, the balloon will receive an immediate reward (i.e., 
revenue). The immediate revenues for accepting requests from 
class-1, class-2, and class-3 are r&, r c , and r S9 respectively. 

In the Loon network, balloons are assumed to be equipped 
with solar panels |4| for harvesting energy from sunlight. 
Additionally, we assume that the balloons can be equipped 
with a rectenna 0 to harvest energy from RF waves. We 
assume that the energy arrival from both solar panel and RF 
rectenna follows the Poisson distribution with mean rates A e , 
and the successful energy harvesting probability is p s e u . The 
maximum capacity of the battery is E. At each time epoch, 
when the access point receives a request, it will consult with 
the admission control policy. The admission control policy 
will determine a decision to accept or reject the request based 
on the request’ class and the current energy level in the battery. 
In this paper, we are interested in maximizing the profit for 
the service provider in the terms of average reward for the 
balloon. 

III. Problem Formulation 

In this Section, we will formulate the optimization problem 
as Markov decision process (MDP) and study a learning 
algorithm to obtain the optimal policy for balloons. 


A. MDP Framework 

An MDP is defined by a tuple of < <S, A, P, R > where 
S is a state space, A is an action space, P is a transition 
probability function, and R is a reward function. The state 
space S of our admission control for the access point on a 
balloon is defined as follows: 

S = js = (e,x) : e e £;x G j, (1) 

where £ = { 0 ,is the energy state space whose 
elements represent energy levels in the battery. A = 
{xb, x c , x s , x e } is a set of events in which Xb, x c , and x s 
are the events when a request is from another access point, 
a user, and a satellite, respectively. x e is an event for energy 
arrival. 

When the system is at state e, if an event x happens, an 
accept/reject decision must be made. Thus, we can define the 
action space as follows: 

i={a:ae{ 0 ,l}}, ( 2 ) 

where 

f 1 , if the arriving request is accepted 
( 0 , if the arriving request is rejected 

To derive the transition probability function P(s, a, s'), 
we consider a discrete time system by using uniformization 
technique ( 6 ) with a uniformization parameter u obtained as 
follows: 

u = A 5 + A c + A s + A e . (3) 

Based on the uniformization parameter u, we can determine 
the probabilities of the events as follows. In the event e, the 
probabilities of a request arriving from other balloons, a user, 
and a satellite/an aircraft are A b/u, A c /u, A s /u, respectively. 
The probability of energy arrival is A e /u. Then, we can derive 
the transition probability matrix for the system. However, 
to do so, we need to know the environment parameters, 
e.g., successful energy harvesting probability, requests’ arrival 
rates. These parameters are generally not known in advance 
and building the model with complete information may not be 
possible. Therefore, we apply a learning algorithm based on 
simulation-based method 0 . The main idea of the simulation- 
based method is based on a “simulator” that can simulate 
the environment by generating environment parameters (e.g., 
a successful energy harvesting probability and arrival rates). 
Then we use the parameters from simulations to derive the 
admission control policy for the access point. Based on the 
simulation-based method, the transition probability function 
can be defined as follows: 

P( s, a, s') = PenvP(s)p(a) , (4) 

where p env is environment parameter (e.g., the successful 
energy harvesting probability), p( s) is the probability that the 
system is at state s, and p( a) is the probability that action a 
is taken. 

When there is a request x arriving at the access point, if 
the request is accepted, the balloon will receive an immediate 















reward r x corresponding to the type of the request. Otherwise, 
the access point gains nothing, i.e., 


R(s, 



r x , 

0 , 


if x G {x b ,x c ,x s },a 

otherwise. 


1 and e + x G £ 


Note that, for the case x = x e , the access point will always 
receive energy if the battery is not full. However, there is no 
reward for such an action. 


B. MDP with Parameterization 

We consider a parameterized randomized policy S), 0 , 
Col. With the parameterized randomized policy, when there 
is a request arriving at the access point, the request will be 
accepted with probability defined as follows: 

Me(s,a) = —- —(5) 

1 + exp(l.5(6*3; - e)) 

where 0 is the parameter vector of the learning algorithm, e is 
the current energy level of the battery and 0 X is the parameter 
for requests from event x. Additionally, the parameterized 
randomized policy /i©(s,a) must not be negative and meet 
the following condition, 

5>e(s,a) = l. (6) 

With the randomized parameterized policy, the transition 
probability function will be parameterized as follows: 

P e (s,s') = Me(s,a)P(s,a,s'). (7) 

aG^l 

Similarly, we can parameterize the immediate reward function 
as follows: 


-Refs) = E Me(s, a)P(s, a). (8) 

aG-4 


We aim to maximized the average reward under randomized 
parameterized policy denoted by and it can be defined 

as follows: 


V>(0) 


lim -Eq 

t —^oo ~t 


" t 

y>e(s fe ) , 

_ k=0 


(9) 


where s& is the system state at step k and Eq [•] is the expected 
reward of the system. 

We then make some assumptions as follows: 


Assumption 1. The Markov chain corresponding to every 
P G V is aperiodic. Furthermore, there exists a state s* which 
is recurrent for every of such Markov chain. 

Assumption 2. For every state s, s' G S, the functions 
P©(s,s / ) and Rq(s) are hounded, twice differentiable, and 
have bounded first and second derivatives. 

Assumption [I] implies that the system has a Markov prop¬ 
erty and Assumption [2] guarantees that the transition probabil¬ 
ity function and the average reward function depend smoothly 
on the parameter vector 0 after they are parameterized by 0. 
Assumption [2] is necessary when we use the policy gradient 
method to adjust vector 0. Under Assumption |2j the average 
reward if(Q) is well defined for every 0 and does not depend 


on an initial state. Furthermore, we have the following balance 
equations: 

y7©(s)P©(s,s') = 7r©(s'), 

S=1 

S 

y^7r e (s) = l, (10) 

S=1 

where 7r©(s) is steady state probability at state s under the 
parameter vector, and thus the average reward function can be 
also defined as follows: 

V>(0) = ype( s )Pe(s). (11) 


C. Policy Gradient Method 

We now can apply the policy gradient method ifTTIl to update 
for the parameter vector 0 as follows: 

0/c+i = 0/c + 7 fe V'0(0fe) (12) 

where is a step size parameter. In the policy gradient 
method, we start with an initial parameter vector 0 O , and 
then the parameter vector 0 will be updated iteratively based 
on ( [12] ). Under Assumption [2] and an appropriate step size, it 
was proved in fill that, lim/c^oo V^(0/c) = 0. That is, the 
average reward -0(0^) converges almost surely. 

We now propose Proposition [T] to calculate the gradient for 
the average reward -0(0). 

Proposition 1. Let Assumption^and Assumption^hold, then 
VV>(©) = y> s (©)(vP©(s) + £ VP©(s,s.')de(s')). 

sG<S s / G<5 

(13) 


d©(s / ) is the differential reward at state s and it can be 
defined as follows: 


do( s) = Eq 


- T—l 

y>e(s fc )-V>(©)|so = s , 

_ k=0 


(14) 


where T = min{& > 0|s£ = s*} is the first future time that 
the state s* is visited. Because of limited space, the proof of 
Proposition [T] can be found in @. 


D. Simulation-based learning algorithm 

We update the parameter vector 0 iteratively based on ( [T2| ) 
with the value of the gradient of average reward calculated 
from Proposition [T] However, it is not easy to calculate the 
terms in Additionally, when the state space and action 
space are large, it is intractable to calculate exactly the value of 
the gradient of the average reward function. Therefore, in this 
paper, we consider the approach that can estimate the gradient 
of the average reward function and then the parameter vector 
0 can be adjusted in an online manner. 

From ([6]), we have X^aeA/f 0 ( s > a ) = 1> so we derive 
XlaG^t a ) = 0- From d8l), we have: 

Vi2e(s) = E V/u©(s,a)P(s,a) 

*eA (i 5 ) 

= ^ V/i©(s,a)(i?(s,a) -^(©)). 
aG^t 







This is from the fact that a ) = 0- Moreover, 

we have 


]T VF S)S '(e)d(s / , 0) = Y Y, VMe(s,a)P SjS -(a)d(s , ,0). 

s' G<S s' ES aGA 

(16) 

Therefore, along with Proposition [T] we derive the follow¬ 
ing results: 

W>(©) = ^7r e (s)(vi?e(s) + Y VP©(s, s , )d©(s , )) 

sG<S s'g5 

= E^^E v M©(s,a)(J?(s,a) - V>(©))+ 

sG<S aGA 

+ EE V^©(s, a)P(s, a, s')d©(s')) 

s' G<S aGw4. 

= E 71 ®^) E v Me(s,a) 

sG<S aGA 

x ((i?(s, a) - V’l©)) + P(8, a, s')de(s')) 

= EE 7r©(s)V/i©(s,a)g©(s,a), 

sG<S aG*4 

(17) 

where 

qe{ s,a) = (i?(s,a) - VK©)) + E P(s, a, s , )d©(s') 

s' G<S 


= £© 


" T—1 

X] (P(s, a) - ^(0)) I So = s, a 0 = a 


Here again T is the first future time that the current state 
s* is visited. q@( s, a) can be interpreted as the differential 
reward if action a is taken based on policy /i© at state s. We 
need to note that d©(s) is the cost at state s and it is different 
from the different cost at state s under action a, i.e., </©(s, a). 
Then, we present Algorithm [I] to update the parameter vector 
0 at the visits to the recurrent state s*. 


Algorithm 1 Algorithm to update parameter vector 0 at visits 
to the recurrent state s*_ 

At the time T m+ i of the (m + l)th visit to state s*, we 
update the parameter vector 0 and the estimated average 
reward ip as follows: 


0m+1 — 0m T~ 7m^m(0m) frri) 5 

tm+l — 1 

V^m +1 — V^m T 77m ^ ^ (_R(s n , a n ) fm) 


n=t TI 


where 


^m + l 1 t-t / \ 

/ r x 7 x - ( x V/i© m (s n , a n J 

E m (0 m , 'Iprn ,) — / ^© m ( s n 5 a n) / \ ; 

Jr? 7 © m (s n ,a n ) 

ii — 

^m+l 1 

9 © m (s„,a n )= X] ( R ( s k,*k) - 4>m)- 

k=n 


In Algorithm |TJ 77 is a positive scalar and 7 m is a step 
size parameter. We derive the following convergence result 
for Algorithm [l] 


Proposition 2. and 

let (O m ) be the sequence of parameter vectors generated by 
Algorithm [7] w ith a suitable step size parameter 7 satisfied 
Assumption |5] then -0(0 m ) converges and 

lim V^(©m) = 0 , 

m—>• 00 

with probability 1. 

The proof of the Proposition [2] can be found in 0 . 

Assumption 3. The step size is deterministic, nonnegative 

and satisfies the following condition, 

00 00 

E 7m = 00, Y 7m < 00- 

m= 1 m=l 

In Algorithm [I] to update the value of the parameter vector 
0 at the next visit time to the state s*, we need to store 
all values of §© m (s n ,a n ) and between two 

successive visits. However, this method could result in slow 
processing. Therefore, we modify Algorithm [T] to improve the 
efficiency. First, we rewrite F rn (Q m ,'ip rn ) as follows: 


tm -\-1 1 

E m (0 m , frri) = ^ ^ h 0 jri (^n7 a n) 

n=tm 


V/iQ m (Sn, a n) 

hQm a n) 


where 




^m + l 1 

= E 

n=tm 


^h@m ( S ri5 a n) 
h@m (®n 5 a n) 


^m + 1 1 


E 

k=n 


(R(s k ,a k ) - ip 

rri) 


tm + l — 1 

= X] (-R(Sfc,a fe ) - i>m)zk+l, 
n=t rn 

(19) 


V U&m ( s fc> a fc) 


M0? 


l ( S fc 5 a fc) 

V/l0 m ( S fc ; a fc ) 

M© m (s fc ,a fc ) 5 


if tm 1 

b — tm T" 1, 


^m+1 


- 1. 


The detail of the algorithm can be expressed as in Algo¬ 
rithm [ 2 | where 7 is a positive scalar and jk is the step size 
of the algorithm. 


Algorithm 2 Algorithm to update 0 at every time step 
At a typical time fc, the state is S&, and the values of 0/~ ,27, 
and ip(Ok) are available from the previous iteration. We 
update 0 and ip according to: 


%k-\- 1 — 


V/i 0 fc (sfe,a fc ) 

M© fc (s fc ,a fc ) ’ 

1 VM0 fc (sfc, a fc) 

' M© fc (s fc ,a fc ) ’ 


if S/c = s* 
otherwise, 


0/c+l = 0fe + 7fc(^(Sfej a /c) - 


V’/c+i = + rmk(R(sk, a /c) - VtO- 


IV. Numerical Results 
A. Experiment Setup 

In this section, we perform simulations using MATLAB to 
evaluate the performance of the proposed learning algorithm. 
In the experiment, we consider the scenario as depicted in 
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Fig. 3: The convergence and the policy of the learning algorithm. 


Fig. [2] The maximum queue size is set at 10 units. There 
are three classes of users, namely, class- 1 , class- 2 , and class- 
3, corresponding to requests from balloons, users on the 
ground, and satellites/aircrafts, respectively. The arrival rates 
of requests from users of class-1, class-2, and class-3, are 60, 
70, and 10 requests per hour, respectively. When a request 
is accepted, the access point will use one unit of energy 
from the battery to serve for the request (i.e., to receive 
data and transmit the data to the destination). Upon accepting 
requests, the access point receives the rewards of 5, 2, and 3 
monetary units for class-1, class-2, and class-3, respectively. 
The energy arrival rate is 110 per hour and the successful 
energy harvesting probability is 0.9. If the balloon harvest 
energy successfully and the battery is not full, the battery 
will increase one unit. For the learning algorithm, the initial 
parameter vector is set at © = ( 6 ^i, 6 ^ 2 , ^ 3 ) = ( 1 , 1 , 1 ), and the 
chosen initial estimated average reward is 0.7. 


B. Numerical Results 

We first consider the convergence of the proposed learning 
algorithm (i.e., Algorithm [2]). Figures [3a] and [3b] show the 
convergence in the terms of the average reward and the param¬ 
eter vector. In both figures, the proposed learning algorithm 
converges within around 5.10 5 -10 6 iterations. In Fig. 


3a 


we 

also compare the average rewards obtained by the learning 
algorithm and the greedy policy that always accepts requests. 
At the convergence points, the average reward obtained by the 
learning algorithm reaches approximately 1.48 which 8 . 8 % 
higher than that obtained by the greedy policy. 


In Fig. [3b] the parameter vector 0 converges to (-1.5577, 
4.3448, 1.7029) for class-1, class-2, and class-3, respectively. 
Then, from the parameter vector obtained from the learning 
algorithm, we can determine the policy for the access point as 
shown in Fig. [3c] In Fig. [3c] the requests from other balloons 
will be always accepted, while the requests from users on 
the ground and satellites will only be accepted only when the 
energy level in the battery is high enough. In particular, the 
access point will accept the requests from a user on the ground 
and satellites when the energy level is higher than 5 units and 
2 units, respectively. 

We then evaluate the impacts of the battery capacity to the 
performance of the system. Specifically, in Fig. [4] we vary the 
maximum battery capacity and observe its impacts to the aver¬ 
age number of accepted requests and the average reward of the 



Fig. 4: The performance of the system when the maximum 
queue size is varied. 


access point. When the maximum battery capacity increases, 
the average reward and average number of accepted requests 
obtained by the learning algorithm (LA) and the greedy policy 
(GP) will increase and saturate when the maximum battery 
capacity is greater than 15. However, it is interesting that when 
the maximum battery capacity increases from 5 to 15, the 
average number requests accepted by the learning algorithm 
is lower than that of the greedy policy. However, the average 
reward obtained by the learning algorithm is higher than that 
of the greedy policy. The reason can be found from the policy 
of the learning algorithm and the policy of the greedy policy 
as shown in Fig. [5] While the greedy policy always accepts 
requests if the battery is not empty, the learning algorithm 
selectively accepts requests from class-2 and class-3 when the 
energy level is high enough. It is also worth to note that, when 
the maximum battery capacity is greater than 15, the average 
numbers of requests obtained by the learning algorithm and 
the greedy policy are equal. However, the average reward 
obtained by the learning algorithm is always greater than that 
of the greedy policy. The reason is because when the battery 
capacity is small, the amount of energy harvested will be 
limited and thus learning algorithm will accept requests which 
yield high reward and reject requests which yield low reward. 
When the battery capacity increases, the amount of energy 
harvested will increase, and thus, there is more chance for the 
requests with low reward to be accepted (as shown in Fig. [5]). 













































However, when the battery capacity is greater than 15, the 
performance of the system will be saturated. The reason is, 
the number of accepted requests depends not only the battery 
capacity, but also on the energy arrival rate. In other words, 
the system performance is constrained by energy arrival, if 
the battery capacity is large enough. 
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Fig. 5: The average number of accepted request. 


We fix the battery capacity at 10 and vary the energy 
arrival rate. When the energy arrival rate increases from 90 
to 130, the probability of accepting requests by the greedy 
policy and the learning algorithm will increase. As a result, 
the average reward as well as the average energy harvested by 
both policies will also increase. Moreover, as shown in Fig. |6j 
when the energy arrival rate is small, the average reward and 
the average energy in the battery with the learning algorithm 
are much greater than those of the greedy policy. However, 
when the energy arrival rate increases, the performance gap 
between the learning algorithm and greedy algorithm becomes 
smaller. Eventually, when the energy arrival rate is large, the 
results obtained by the greedy policy will approach those of 
the learning algorithm. 


V. Summary 

In this paper, we have studied and developed an opti¬ 
mization model for the optimal energy control problem for 
a network in the sky. The aim is to maximize the network 
performance as well as the profit for the network provider. 
We have first formulated the problem as a Markov decision 
process and then applied an online learning algorithm based 
on the gradient method to obtain the optimal policy for the 
access point deployed on a balloon. The numerical results 
have been presented to show the impacts of parameters to the 
system performance as well as to show the convergence and 
the efficiency of the proposed learning algorithm. 
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Fig. 6: The performance of the system when the energy arrival 
probability is varied. 











































































































































